The smarty4covid dataset and knowledge base as a framework for interpretable physiological audio data analysis

Harnessing the power of Artificial Intelligence (AI) and m-health towards detecting new bio-markers indicative of the onset and progress of respiratory abnormalities/conditions has greatly attracted the scientific and research interest especially during COVID-19 pandemic. The smarty4covid dataset contains audio signals of cough (4,676), regular breathing (4,665), deep breathing (4,695) and voice (4,291) as recorded by means of mobile devices following a crowd-sourcing approach. Other self reported information is also included (e.g. COVID-19 virus tests), thus providing a comprehensive dataset for the development of COVID-19 risk detection models. The smarty4covid dataset is released in the form of a web-ontology language (OWL) knowledge base enabling data consolidation from other relevant datasets, complex queries and reasoning. It has been utilized towards the development of models able to: (i) extract clinically informative respiratory indicators from regular breathing records, and (ii) identify cough, breath and voice segments in crowd-sourced audio recordings. A new framework utilizing the smarty4covid OWL knowledge base towards generating counterfactual explanations in opaque AI-based COVID-19 risk detection models is proposed and validated.


Background & Summary
The COVID-19 pandemic induced innovation in many technological sectors leading to the development of a variety of means to combat the global outbreak such as vaccines, bio-sensors facilitating the diagnosis at the point of care, 3D printed ventilators and a wealth of mobile applications.More specifically, leveraging the latest trends on mobile health technologies, several applications have been implemented to fight COVID-19 with the aim of creating awareness, collecting suitable data for health survey and surveillance, reducing person-to-person contacts, offering telemedicine services, tracking COVID-19 contacts, supporting healthcare professionals in decision making, facilitating communication and collaboration among healthcare providers and serving as a means towards coordinating emergency response and transport 1 .Further to the above, Artificial Intelligence (AI) and Machine Learning (ML) have played an important role in the response to the COVID-19 related challenges through accelerating the research and treatment while offering remarkable solutions to the diagnosis taking into consideration several biomedical data such as X-rays, Computer Tomography (CT) scans, electrocardiogram and audio recordings 2,3 .Furthermore, ML has demonstrated promising performance in epidemiological modelling based on social and weather data 4 .
A prompt diagnosis of newly infected cases is of particular importance.However, RT-PCR tests and CT scans suffer from certain limitations such as variable sensitivity and increased turnaround time while requiring highly trained staff, approved laboratories and expensive equipment.The antigen tests constitute an alternative, nevertheless they demonstrate poor sensitivity 5 .An m-health approach able to support affordable, fast, sustainable and effective testing facilitating multiple repetitions to track progression, could contribute in containing the spread and suppressing resurgence 6 .Within this context, the idea of harnessing the power of AI coupled with mobile technologies to implement an easy-to-use and widely accessible COVID-19 detection method, has motivated the application of signal analysis and AI on audio recordings of cough, voice and breath towards the detection of innovative COVID-19 related bio-markers 5,7 .
In recent literature, most approaches for predicting the COVID-19 risk from audio recordings rely on deep learning models, which typically require large amounts of data to be trained.Therefore the development of curated COVID-19 audio datasets is crucial for achieving accuracy and reliability 7 .Several studies have been oriented to collect audio recordings from citizens following a crowd-sourcing approach through the use of a web interface.The first attempt in this direction has been initiated within the frame of the COVID-19 Sounds project 8 .The COVID-19 Sounds database consists of 53,449 audio samples each including 3 to 5 deep breaths through mouth, 3 voluntary coughs and 3 voice repetitions of a predefined short sentence.Coswara is another crowd-sourced database consisting of various kinds of sounds such as breaths (shallow and deep), voluntary coughs (heavy and shallow), sustained vowel phonation (/ey/ as in made, /i/ as in beet, /u:/ as in cool), and number counting from one to twenty (normal and fast-paced) 9 .Coughvid is also considered to be one of the largest crowd-sourced databases, yet including only cough sounds 10 .To date, the latest version of Coughvid is publicly released with 27,550 cough recordings.As illustrated in Fig. 1 the number of the obtained audio samples -each audio sample includes all the considered types of audio recordings -ranges from 2,030 to 53,449 while the prevalence of COVID-19 cases is relatively low especially in the Coughvid and the COVID-19 Sounds datasets.All these available databases include various demographics, symptoms, and co-morbidities in order to provide further information towards detecting COVID-19.
A common pitfall of crowd-sourced data is that it contains audio recordings unrelated to the desired content of the database and audio recordings characterized by low quality and increased noise.This highlights the need to apply methods for data curation.Within the frame of the Coughvid and the COVID-19 Sounds projects, computational models have been developed to detect the specific segments in the audio signal that contain the considered audio recording.More specifically, the YAMNet pre-trained audio classification network has been used to filter out noisy, silent, low-quality, and inconsistent recordings 8 in the dataset.The model has been evaluated on a small subset (i.e.3,067 audio recordings) that has been manually annotated and has achieved an accuracy up to 88%.In the case of the Coughvid dataset, a small number (i.e.215) of audio recordings have been selected and manually annotated as cough or non-cough sounds.This small datataset has been used to develop an eXtreme Gradient Boosting classifier towards discriminating cough from non cough audio recordings taking as input 68 audio features in the domains of (i) Mel Frequency, (ii) Time, and (iii) Frequency.Following a 10-fold cross validation framework the COUGHVID model has achieved sensitivity and c-statistic equal to 78.2% and 96.4%, respectively 10 .The Coswara dataset has been entirely manually annotated.
Fig. 1 The smarty4covid contribution.Each sample is considered to contain audio recordings of all sound types that are collected within the frame of each crowd-sourcing approach.
The development of robust machine learning models able to detect COVID-19 is particularly challenging due to the heterogeneity of the available datasets, the low number of cases (positive for COVID-19) versus controls (negative for COVID-19), and deficiencies related to COVID-19 variants and factors that strongly affect the infection, for example the vaccination status against COVID-19.On top of this, there are biases in the available datasets that need to be thoroughly investigated, while there is an increased risk of model over-fitting especially when complex modelling strategies are applied 6 .The realistic performance of an audio based digital testing for COVID-19 has been explored through artificially creating biases in the development dataset, for example introducing gender bias into the data by selecting a high percentage of cases as males, and evaluating their impact on the model's efficacy 6 .Another research challenge is the development of a knowledge representation of the available data/information that enables data consolidation and reasoning.The latter is particularly important in order to ensure transparency and gain end-users' trust through providing explanations of the estimated risk.From this perspective, the deployment of smart interfaces that present end users with human understandable interpretations and explanations of their estimated COVID-19 probability can greatly support informed decision making while enhancing human supervision towards the realization of a human centered AI approach.The development of responsible AI models requires data that is richly annotated with metadata, expert labels, and semantic information.This additional information can be used as high-level features for training explainable AI models, since these features are more understandable for humans than for example audio signals or spectrograms that usually form the input space of deep learning models.Furthermore, this additional information can be utilized for post hoc explainability and analysis of black-box classifiers, which is particularly useful since opaque deep learning models are usually applied towards detecting COVID-19 from audio recordings [11][12][13] .
The smarty4covid project aspires the creation of an intelligent multimodal framework for COVID-19 risk assessment and monitoring based on Explainable Deep Learning.Following the necessary approvals from the National Technical University's Ethics Committee of Research, a responsive web based application (www.smarty4covid.org)has been implemented and publicly released as a means of data collection.The smarty4covid dataset 14 contains in total 18,265 audio recordings of cough, breath (regular, deep) and voice corresponding to 4,673 users (Greek and Cypriot citizens).It also includes other self-reported information related to demographics, symptoms, underlying conditions, smoking status, vital signs, COVID-19 vaccination status, hospitalization, emotional state, working conditions and COVID-19 status (i.e.positive, negative, not-tested).The entire dataset has been cleaned of erroneous and noisy samples, and a subset of the dataset (i.e.1,475 samples) has been labeled by medical experts.Furthermore, all available information has been encoded into an innovative web ontology knowledge (OWL) base that also contains a rudimentary hierarchy of concepts.The medically related concepts in the OWL knowledge base are provided in the form of ids from SNOMED-CT 15 .
The curated crowd-sourced smarty4covid dataset 14 is publicly released, yet all audio records of voices that are considered personal data according to the GDPR regulation are excluded (Fig. 1).The smarty4covid OWL knowledge is also made available in order to enable data consolidation from multiple databases.The smarty-4covid OWL knowledge base offers an interpretable framework of high expressiveness which can be employed to explain complex machine learning models through identifying semantic queries over the knowledge that mimic the model 16 .The smarty4covid dataset 14 has been utilized towards the development of models able to: (i) classify segments of audio signals as "cough", "breath", "voice", and "other", and (ii) detect inhalation and exhalation segments from breathing recordings, that can be used for extracting clinically related features such as respiratory rates (RR), inhalation to exhalation ratio (I/E ratio), and fractional inspiration time (FIT).The smarty4covid OWL knowledge has been validated as a means of generating counterfactual explanations and discovering potential biases in the available datasets.

Methods
The overall approach towards the development of the smarty4covid database is depicted in Fig. 2. It includes a crowd-sourcing data collection strategy followed by a two-step data curation method involving data cleaning and labeling.A multi-modal dataset was collected including audio records and tabular data.The curated dataset was exploited for extracting breathing related features, creating publicly available data records, and developing the smarty4covid OWL knowledge that enables data selection and reasoning.
Crowd-sourcing Data Collection.The smarty4covid crowd-sourcing data collection was approved by the National Technical University's Ethics Committee of Research (16141/15.04.2020) and complied with all relevant ethical regulations.A responsive and user-friendly web-based application (www.smarty4covid.org) was implemented targeting Greek and Cypriot citizens older than 18 years old.The smarty4covid questionnaire consisted of several sections accompanied by instructions for users to perform audio recordings of voice, breath and cough and provide information regarding demographics, COVID-19 vaccination status, medical history, vital signs as measured by means of oximeter and blood pressure monitor, COVID-19 symptoms, smoking habits, hospitalization, emotional state and working conditions.Four types of audio recordings were considered: (i) three voice recordings where the user was required to read a specific sentence, (ii) five deep breaths, (iii) 30 s regular breathing close to the microphone of the device and (iv) three voluntary coughs.A framework safeguarding data protection while taking into consideration all the necessary ethical aspects was implemented.The user terms and privacy policy were appropriately drafted clarifying the data usage and conditions for sharing, the users' rights and the exact measures taken to protect the data.Prior to initiating the smarty4covid questionnaire, the users were required to read the informed consent, which included the links to the user terms and privacy policy, in order to provide their consent.Following an effective media planning, more than 10,000 individuals provided demographic information and underlying medical conditions to the smarty4covid application, yet almost half of them (i.e.4,679) gave the necessary permissions to perform the audio recordings.The web-based application was released in January 2022 during the spread of the omicron wave in Greece, resulting in high COVID-19 prevalence (17.3% of users were tested positive for COVID-19).
Data Curation.Part of the crowd-sourced dataset was invalid due to erroneous audio recording submissions by the users and the presence of distortions and high background noise.The data cleaning process was performed by means of a crowd-sourcing campaign utilizing the Label Studio (https://labelstud.io/)open source data labeling tool.AI engineers who volunteered to annotate the audio signals, signed a Non Disclosure Agreement (NDA) and granted with the necessary access permissions.A user-friendly environment was implemented enabling the annotators to listen the audio signals and answer to questions regarding their validity (yes/no) and their quality (Good, Acceptable, Poor) in terms of background noise and distortion.In order to evaluate the quality of the annotations, a set of randomly selected audio files (i.e.1,389) was considered more than once and up to 5 times in the annotation procedure.A high level of consistency (92.5%) among the annotators was observed indicating that there was no need to have multiple annotators for each audio recording.
The smarty4covid crowd-sourced dataset was enriched with labels annotated from healthcare professionals (pulmonologists, anesthesiologists, internists) who volunteered to characterize the collected audio recordings in terms of audible abnormalities and to provide personalized recommendations regarding the need for medical advice.To this end, four crowd-sourcing campaigns were initiated utilizing the Label Studio.Three campaigns focused on the audio recordings (breath, voice, cough).As depicted in Fig. 3, the healthcare professionals were asked to assess the presence of audible abnormalities by selecting one or more options from the available labels.In the fourth campaign, the healthcare professionals were exposed to all available multimodal information about the user, excluding vital signs (oxygen saturation, beats per minute (BPM), diastolic/systolic pressure) that would lead them to a biased assessment, in order to estimate the risk of health deterioration and suggest a next course of action: a) Seek for medical advice, b) Repeat the Smarty4Covid test in 24 hours and c) In case you notice changes in your health status, repeat the Smarty4Covid test.They were also asked to define a level of confidence (from 1 to 10) in their assessment.
Breathing Feature Extraction.Respiration is a complex physiological process, involving both voluntary and involuntary processes, as well as underlying reflexes.A breathing pattern is the upshot of a fine coordination between peripheral chemoreceptors, central nervous system's organizing structures, lung mechanoreceptors and parenchyma, musculoskeletal components, intrinsic metabolic rate, emotional state, and many others.A breathing pattern adopted at any given moment is assumed to be that which produces adequate alveolar ventilation at the lowest possible energy cost, given the contemporary system's mechanical status and organism's metabolic needs.Any disruption in any of these respiratory homeostasis' pillars, will be reflected in a change of the respiratory pattern, shifting this balance to the best for the prevailing conditions energetic state 17 .A viral infection could be a breathing pattern's disorientation factor [18][19][20][21] .Some quantitative indicators commonly used to describe a breathing pattern and its readjustments are RR, respiratory phases and volumes, gases partial pressure, blood gases analysis and other 17,22 .
Most of the studies associated with COVID-19 crowd-sourced databases of breathing audio recordings explore features generated through signal processing or deep learning.The smarty4covid dataset 14 innovates the current state of the art by including clinically relevant important and informative respiratory indicators extracted from regular breathing records, such as the RR, I/E ratio, and FIT.RR is the number of breaths per minute, that is normally 16-20 breaths/min.It can be affected by both external and internal factors such as the temperature, endogenous acid-base balance, metabolic state, diseases, injuries, toxicity, etc. I/E ratio is the ratio between the inspiratory (T i ) and expiratory time (T e ) and it can be indicative to a flow disturbance in the respiratory tract 23 .Normal breathing usually presents 1:2 or 1:3 I/E ratio at rest 23 while airways obstruction may lead to prolonged expiration or inspiration resulting to an abnormal I/E ratio.FIT, also termed as the inspiratory "duty cycle" of the respiratory system, is the ratio between (T i ) and the duration of a total respiratory cycle (T tot ) 22 .It provides a Fig. 2 Overall approach towards developing the smarty4covid database.rough measure of airway obstruction and stress on the respiratory muscles.Table 1 summarizes the description and the normal ranges of the aforementioned respiratory indicators.
A two step approach was developed in order to extract T i and T e from the crowd-sourced breathing audio signals: (i) localization of the segments on the audio signal that contains breathing, and (ii) detection of the exhaling and inhaling parts.In the first step, an AI-based model, described in the "Technical Validation" Section, was applied.The obtained breathing segments were split into non silent intervals.The second step was particularly challenging since either the inhalation part, that was characterized by low mean amplitude, was not appropriately captured due to the hardware of the recording device or due to the short distance of the sound source from the microphone during the exhalation phase, resulting in distortion of the waveform.In order to face this challenge, an unsupervised method was developed with the aim to identify similar parts on a single breathing audio signal that in turn could be considered as either inhalation or exhalation.This particular method presents several advantages over the state of the art 24 , since it doesn't require a dataset of human-labeled data for training while there is no need to take into consideration prior knowledge that inhalation follows exhalation and vice versa.Furthermore, the application of the unsupervised method on a single breathing audio signal adds robustness against distortion and background noise since all inhalation/exhalation parts of the same breathing recording are subject to the same level of distortion and background noise.
The unsupervised method featured a clustering algorithm based on affinity propagation 24 at a frequency level.To this end, the mel-spectrogram (MFCC-128) of the audio signal was obtained and transformed into a vector of 128 frequencies each one corresponding to the summation of the respective frequencies over time.The obtained clusters were labeled as "inhalation", "exhalation" or "other" though applying a heuristic approach.More specifically, for each cluster, the mean amplitudes were calculated by averaging the mean amplitudes over all the members of the cluster.Next, the clusters were sorted from largest to smallest mean amplitude.The top listed cluster was considered as exhalation while the second cluster (if existed) as inhalation.The remaining clusters were labeled as "other".
For validation purposes, the inhalation and exhalation parts of 127 audio recordings of regular breathing, were manually annotated in order to enable the calculation of the corresponding respiratory indicators.The proposed unsupervised method achieved Root Mean Square Error (RMSE) up to 1.77, 0.21 and 0.08 for the RR, FIT, and I/E ratio, respectively.The RMSE values are considered to be low taking into consideration the normal  ranges of each respiratory indicator (Table 1).The algorithm's efficacy in accurately identifying the inhalation and exhalation parts within the audio recordings of regular breathing, was assessed by applying the Intersection over Union (IoU) criterion 25 .The obtained IoU values was up to 75% for inhalation and 76% for exhalation.These results indicate an acceptable degree of alignment between the algorithm's output and the actual respiratory phases.

Data Records
Part of the smarty4covid crowd-sourced dataset (4,303 submissions) was organized into data records in order to be publicly available.The data records are deposit in the Zenodo Repository 14 .As depicted in Fig. 4, each directory contains the submissions of a specific user.The user directory is named after the user's id that is generated according to the UUID V4 protocol.Apart from the submissions, a json file ("demographics_underlying_conditions.json") with information regarding demographics (BMI, age group, gender) and potential underlying conditions (Table 2) is also included.Each submission corresponds to a separate sub-directory that is named after the unique submission id and it contains: 1. valid audio recordings of cough ("audio.cough.mp3"),deep breathing ("audio.breath_deep.mp3")and regular breathing ("audio.breath_regular.mp3").Each audio recording has a sampling rate of 48 kHz and a bitrate of 64 kb/s.2. a json file ("main_questionnaire.json") with information related to the COVID-19 test (result, type, and date), COVID-19 vaccination status, COVID-19 related symptoms, vital signs and more (Table 3).3. a json file ("breathing_features.json") with the extracted respiratory indicators and the manual annotations of the breathing phases (inhalation, exhalation) on the breathing audio signal (Table 4).4. four json files ("experts.breath.json","experts.cough.json","experts.medical_advice.json","experts.voice. json") including the input/labels (characterization, advice) from the healthcare professionals (Tables 5-7).
Knowledge Base.A web-ontology language (OWL) knowledge base (https://www.w3.org/OWL/) was developed motivated by the need of data consolidation from different relevant databases (Coughvid, COVID-19 sounds, Coswara) and the application of complex queries for the detection of users with specific characteristics.All available information resulting from the crowd-sourcing, data cleaning and data labeling procedures were also released in the form of the smarty4covid OWL knowledge base.The smarty4covid OWL knowledge base is hosted on the same Zenodo Repository as the data records 14 .In general, using a vocabulary  Based on these axioms, the hierarchies of concepts and roles can be defined in the TBox.
In the smarty4covid OWL knowledge base, the set of individual names (IN) contains a unique name indicative to each participant, questionnaire, audio file, healthcare professional that participated in the labeling procedure and the corresponding characterizations of the audio records.(IN) also includes unique names for each declared symptom, COVID-19 test and preexisting condition that is linked to the corresponding questionnaire (e.g symptom, COVID-19 test) and participant (e.g.underlying condition), respectively.These individuals are linked through appropriately defined roles.The role names RN and their defined hierarchy is depicted in Fig. 5.Each role is associated with a domain and a range indicative to the types of the individuals that can be linked through this role.In particular, the role hasCharacterization links audio files to characterizations as labelled by the healthcare professionals, and characterizedBy links characterizations to instances of the healthcare professionals.The role hasAudio and its children link questionnaires to audio files.The roles has-CovidTest and hasSymptom link questionnaires to instances of COVID-19 tests, self-reported symptoms, and vaccination status, respectively.The role hasPreexistingCondition links participants to preexisting conditions, while hasUserInstance links participants to their submitted questionnaires.
The set of concept names CN involves concepts that describe instances of audio, COVID-19 tests, preexisting conditions, symptoms, users and questionnaires.For audio related concepts, their hierarchy is shown is Fig. 5a.Specifically, there is a concept for each type of audio recording (regular breathing, deep breathing, voice, cough), and concepts regarding the audio quality.Audio instances can additionally be linked, via the hasCharacterization role to audible abnormalities, for which the hierarchy of concepts is shown in Fig. 5h.Similarly, all preexisting conditions that appear in the questionnaire are organized as concepts in a hierarchy as shown in Fig. 5d, and all symptoms are part of the symptom hierarchy, shown in Fig. 5c.Furthermore, the User concept subsumes concepts related to the different age and gender of the participants, as shown in Fig. 5e, while the UserInstance concept that corresponds to a specific questionnaire submitted by a user, also subsumes a hierarchy based on the different possible answers in the questionnaire, shown in Fig. 5b.Finally, the concepts related to COVID-19 tests, shown in Fig. 5g, are used to define the type of test and its outcome.
The described hierarchies of concepts and roles are provided in OWL format in the file [smarty-ontology.owl].Using this terminology, all information presented in the dataset 14 is asserted in the form of triples, provided in the file [smarty-triples.nt].An example of a smarty4covid user is depicted in Fig. 6.This user who is a female (20-30 years old) and has asthma, has submitted a questionnaire declaring a positive PCR test and

Technical Validation
Inferences from statistical analysis.The representativeness in the smarty4covid dataset 14 was explored in terms of demographics, symptoms, vaccination status, COVID-19 prevalence and level of anxiety.The distribution of gender, age and COVID-19 test results is depicted in Fig. 6.A higher percentage (61.0%) of male versus female users was observed, yet a wide range of ages was present.Most of the users' ages were between 30 to 59    7c.
Figure 8 illustrates the presence of underlying medical conditions associated with the progress of COVID-19 in the smarty4covid dataset 14 .More than 1 out of 4 users (27%) reported at least one underlying medical condition while hypertension was the most commonly reported condition (Fig. 8).The distribution of the underlying medical conditions was similar to the one published by Eurostat 26 that considered the general population in Greece.
Referring to the COVID-19 related symptoms, more than half of the users reported at least one symptom.Figure 9 depicts the frequency of each symptom versus vaccination status (not vaccinated, fully vaccinated and booster dose).It can be inferred that users with booster dose presented fewer symptoms than those who were not vaccinated.Figure 9 illustrates the percentages of positive and negative for COVID-19 users for each vaccination status.It can be seen that the COVID-19 prevalence is lower within the booster dose vaccinated population.The smarty4covid dataset 14 also included vital signs (oxygen saturation, beats per minute (BPM), diastolic/ systolic pressure) as measured by means of relevant devices and self-reported COVID-19 related anxiety level.A box plot of the oxygen saturation for different age groups (Fig. 10) presents oxygen saturation reduction against age progression.Figure 10 depicts the vaccination status versus anxiety.Higher levels of anxiety presented higher percentage of users vaccinated with booster dose.Training AI models for classification of audio types.An AI-based model for classifying audio segments into cough, voice and breathing was developed utilizing the smarty4covid dataset 14 in order to: (i) validate the quality of the smarty4covid dataset towards training an AI model with generalization capabilities, (ii) support the automated cleaning of crowd-sourced audio recordings, and (iii) be integrated in relevant crowd-sourcing platforms for detecting whether the submitted audio recording is valid and if needed to prompt the users to repeat the audio recording.The development of the model was based on the smarty4covid audio recordings along with their respective annotations regarding their validity and quality (i.e.data cleaning procedure).More specifically, the dataset included 2,647 cough recordings, 2080 breathing recordings, and 2,593 voice recordings with duration greater than 10 s and Acceptable or Good quality.
Architecture.As depicted in Fig. 11, the classifier was based on the combined use of 2D Convolutional Neural Networks (CNN) that received as input the Mel spectrograms of audio segments of a specific duration (d) and output the probability of detecting cough, breath, and voice.The frequency axis of the Mel spectrograms had size equal to 128, while the size of the time axis (d) was a hyperparameter which was tuned through applying a grid search from 128 to 1024 corresponding to approximately 1 to 10 s of audio, respectively.Each CNN consisted of b stacked blocks containing l convolutional layers followed by a 2 × 2 max pooling layer and a dropout layer with the dropout probability set to its default value equal to 0.5.The convolutional layers of each block featured k 3 × 3 relu activated kernels and applied identical padding in order to ensure that the output of each layer had      applying a randomized selection of a labeled (cough, voice, breathing) segment of a width d.The CNN's training procedure aimed at driving the optimization of the categorical cross entropy loss through the Adam algorithm 27 .During inference, a sliding window of length d and step 1 was used to extract all (overlapping) segments of the audio signal, which were then fed to the trained CNN that estimated the probabilities of detecting cough, voice and breathing.Following this approach, the classification of an entire audio signal was also feasible, by combining (i.e.averaging) the estimated probabilities over all extracted segments.
Results and external evaluation.Aiming at exploring the impact of width (d) on the model's performance, a low (i.e.128) and a high (i.e.1024) value was applied resulting in two classifiers operating in short (1 s) and long (10 s) time scale, respectively.In order to evaluate the generalization capabilities of the classifiers, the Coswara dataset 9 served as external validation dataset since it includes all the three types of the considered audio recordings.Table 8 presents the confusion matrix of the obtained results.The long time scale classifier achieved a slightly better discrimination performance than the one obtained by applying the short time scale classifier (accuracy = 95.3%vs 94%, c-statistic = . 0 995 vs 0 992 ., macro F1 score = .0 953 vs 0 941 .).Leveraging upon the proposed architecture's flexibility, a multi-scale classifier was developed as an ensemble of the short and long time scale classifiers by applying a soft combination scheme (i.e.averaging) on the primary output probabilities.The obtained confusion matrix (Table 8) indicated that the multiscale classifier had the highest sensitivity in detecting cough and breathing and the lowest one in detecting voice, yet the difference among the classifiers' performances was small.The multiscale classifier's effectiveness was further assessed on a subset of the COUGHVID dataset 9 that included annotations from experts in terms of audio quality (i.e.good, ok, poor, no cough).The subset contained 2,890 audio recordings in total and only few of them (i.e.74) were annotated as not recording a cough event.Table 9 presents the obtained results in a confusion matrix.It can be seen that the multiscale classifier achieved good performance, yet a small decrease in c-statistic compared with the one obtained on the Coswara dataset ( .0 995 versus 0 914 .) was observed due to the COUGHVID dataset's nature in terms of containing only one audio type (i.e.cough) and few non cough samples.The multiscale classifier's performance on the Coswara dataset was compared with that obtained by applying the COUGHVID classifier.The COUGHVID classifier is based on pretrained XGBoost and scaler and receives as input handcrafted features.A probability decision threshold equal to .0 8 is suggested by its creators.The obtained modified smarty4covid OWL knowledge is subject to conceptual edits, which apply alterations on the concepts in order to identify the minimum changes that result in switching the estimated classification to a desired class.A thorough description of utilizing conceptual edits as counterfactual explanations is presented in 28 .Figure 13 illustrates two examples of identifying the minimal conceptual edits in order for a positive COVID-19 user to become negative.The global counterfactual explanations are obtained by adding the minimal concepts edits over all users.
In order to validate the aforementioned framework, a COVID-19 classifier (https://github.com/kinezodin/ntuautn) was developed and potential biases were explored taking into consideration the Coswara dataset 9 as development dataset and the smarty4covid dataset 14 as explanation dataset.Audio recordings of cough were fed into the COVID-19 classifier in order for the latter to produce the COVID-19 probability.The development of the COVID-19 classifier was based on ensembles of CNNs that received as input segments of the cough audio signal's mel spectrogram with specific duration.The overall architecture comprised three 2D CNN layers, followed by a fully connected layer for generating the COVID-19 probability.Utilizing the Coswara 9 dataset as development dataset, a 5-fold evaluation strategy was conducted that resulted in c-statistic up to 0.764 ± 0.038.However, the trained COVID-19 classifier had a low performance when applied on the smarty4covid dataset (c-statistic < 0.50) that indicated the presence of biases.Conceptual edits was applied on the modified smarty-4covid OWL knowledge including only those concepts that were directly expressible through a cough audio recording.For this reason several concepts such as vaccination status, travel history, and pandemic-related anxiety were excluded.The obtained global explanations are presented in Fig. 14.Gender was considered to be the most critical factor towards switching from positive for COVID-19 to negative.This was a bias that needed to be further explored whether it was a bias in the Coswara dataset or in the COVID-19 classifier.The application of some basic statistics (χ 2 test) on the Coswara dataset revealed that the COVID-19 prevalence in the male population was significantly higher than the one in the female population (Fig. 15).Statistical significant differences in COVID-19 prevalence, were also revealed between different age groups (Fig. 15).

Usage Notes
Access to the data is subject to certain restrictions to ensure data privacy, security and responsible use.Access to the data is granted based on the data applicant's capacity and purpose of data usage.Allowed uses focus solely on research purposes, yet excluding commercial use for data re-distribution and targeted marketing,  re-identification attempts, discrimination, hacking or unauthorized access to sensitive health information that may lead to identity theft, malicious purposes (e.g.phishing and social engineering, financial scams, misleading health claims), misuse for political or ideological purposes, and misrepresentation in terms of providing false information or manipulating the data-material to create misleading or inaccurate research findings.Research by commercial organizations towards the development of new drug treatments, diagnostic measures, and medical devices is allowed.Data users should comply with all applicable laws, regulations, and data protection policies in their jurisdiction, as well as any additional terms and conditions set forth by the data providers.The procedure towards data application is initiated by requesting access including personal information (i.e.full name and e-mail address) and the purpose of use.Following a positive revision of the requested access, the data applicant will be required to read and agree with the terms of the Non-Disclosure and material transfer agreement (https:// tinyurl.com/bdf7x9sc)by submitting a signed copy to the data owner.In the next, a link will be provided to the data applicant to access the data.

Fig. 3 3 FIT
Fig.3The smarty4covid labeling campaigns.The cough, breath, and voice campaigns include labels that are indicative of respiratory abnormalities.
RN IN where CN, RN, IN are mutually disjoint sets of concept names, role names and individual names respectively, a knowledge base (K A T , ⟨ ⟩ = ) can be built through creating the Assertional Database (ABox -A) and the Terminology Database (TBox -T).The ABox includes assertions of the form C(a), r(a, b) where C ∈ CN, r ∈ RN,

Fig. 5
Fig. 5 Hierarchies of concepts and roles from the smarty4covid knowledge base.

Fig. 6
Fig. 6 Example of the structure of the smarty4covid knowledge base.Blue nodes represent individuals, and orange nodes concepts.Edges labeled as IsA represent concept assertions from the ABox, and subClassOf edges represent inclusion axioms from the TBox.

Fig. 8
Fig. 8 Distribution of underlying medical conditions.

Fig. 11
Fig. 11 Overview of the AI-based model for classifying audio segments into cough, voice, and breathing.Single-scale and multi-scale approaches are presented.
the model's output.As depicted in Fig.12, it includes two different datasets: (i) the development dataset that is used to train an AI based classifier, and (ii) the explanation dataset that is used to test the trained AI based COVID-19 classifier.The trained AI based COVID-19 classifier is applied on the explanation dataset and the estimated classifications feed the smarty4covid OWL knowledge base by replacing the actual classifications (i.e.COVID-19, non COVID).

Fig. 13
Fig.13 Examples of applying conceptual edits on the smarty4covid OWL knowledge.User 1 and user 2 are the closest COVID-19 and non COVID-19 neighbours in the smarty4covid OWL knowledge.The minimal concept edits towards switching from positive to negative includes: (a) changing the gender from female to male, (b) changing the gender from female to male and the symptom from headache to cough.

Fig. 14
Fig.14 Global counterfactual explanations taking into consideration the Coswara dataset9 as development dataset and the smarty4covid dataset 14 as explanation dataset.

Fig. 15
Fig. 15 Covid-19 prevalence (a) between male and female population and (b) across different age groups in the Coswara dataset 9 .

Table 1 .
Normal ranges of respiratory indicators.

Table 2 .
Demographics and underlying conditions json file description.
Continueda headache while being a smoker.Her audio recording of cough has been labeled by medical professionals as featuring audible choking.

Table 4 .
Breathing features json file description.

Table 7 .
Experts' voice annotation json file description.

Table 8 .
Table 10presents the confusion matrix achieved by applying the COUGHVID classifier.The superiority of the multiscale classifier over the COUGHVID classifier was demonstrated through the evaluation metrics of accuracy ( .Confusion matrices of the short, long and multi-time scale classifiers when evaluated on the coswara dataset.
Conceptual edits on the smarty4covid OWL knowledge to produce counterfactual explanations.Taking into consideration the increased demand of transparent AI, a framework that leverages the high expressiveness of the smarty4covid OWL knowledge base is proposed towards identifying potential biases in the COVID-19 classification models and the datasets used for their development.The framework utilizes counterfactual explanations that can provide meaningful information by generating the most influencing factors affecting

Table 9 .
Confusion matrix of the multiscale classifier when evaluated on the COUGHVID dataset.

Table 10 .
Confusion matrix of the cough-detection model provided by COUGHVID when evaluated on the Coswara dataset.